Because it is a statistical language, there are a large number of probability distributions in R by default and an even larger number that can be loaded from packages. The table below gives a listing of the most common distributions in R, the name of the function within R, and the parameters of the distribution
| Distribution | R name | Parameters |
|---|---|---|
| beta | beta | shape1, shape2, ncp |
| Binomial | binom | size, prob |
| Cauchy | cauchy | location, scale |
| chi-squared | chisq | df, ncp |
| exponential | exp | rate |
| F | f | df1, df2, ncp |
| gamma | gamma | shape, scale |
| geometric | geom | prob |
| hypergeometric | hyper | m, n, k |
| log-normal | lnorm | meanlog, sdlog |
| logistic | logis | location, scale |
| Negative binomial | nbinom | size, prob |
| normal | norm | mean, sd |
| Poisson | pois | lambda |
| Student’s t | t | df, ncp |
| uniform | unif | min, max |
| Weibull | weibull | shape, scale |
| Wilcoxon | wilcox | m, n |
R actually provides four related functions for each probability distribution. These functions are called by adding a letter at the beginning of the function name. The variants of each probability distribution are:
The first argument to these functions is the same regardless of the distribution and is the value x for “d”, the quantile q for “p”, the probability p for “q”, and the sample size n for “r”
All of this will make more sense once we consider a concrete example. Let’s take a look at the normal probability density function first, since it’s the one you’re most familiar with. If you use ?dnorm you’ll see that for many of the function arguments there are default values, specifically mean=0 and sd=1. Therefore if these values are not specified explicitly in the function call R assumes you want a standard Normal distribution.
x = seq(-5,5,by=0.1)
plot(x,dnorm(x),type='l') ## that's a lowercase "L" for "line"
abline(v=0) ## add a line to indicate the mean ("v" is for "vertical")
lines(x,dnorm(x,2),col=2) ## try changing the mean ("col" sets the color)
abline(v=2,col=2)
lines(x,dnorm(x,-1,2),col=3) ## try changing the mean and standard dev
abline(v=-1,col=3)
The above plot can also be produced using R’s curve function
curve(dnorm,-5,5) ## -5,5 asks R to plot the curve in between -5 and 5
abline(v=0)
curve(dnorm(x,2),-5,5,add=TRUE,col=2) ## add=TRUE adds the new curve to the previous plot
abline(v=2,col=2)
curve(dnorm(x,-1,2),-5,5,add=TRUE,col=3) ## try changing the mean and standard dev
abline(v=-1,col=3)
You are welcome to use either approach in the following examples.
Use the command par(mfrow=c(2,3)) to have 6 plots together on the same panel. Plot the density of student’s t distribution (with location 0 and scale 1) with degrees of freedom 1,2,3,5,10,100 in 6 different plots. Overlap each plot with the density of a standard normal distribution. Use different colors for t density and normal density. Comment on your finding. (Hint: t distribution converges to normal distribution as df goes to \(\infty\)).
par(mfrow=c(2,3))
You will use density functions quite frequently to estimate the likelihood of data. For example if we collected a data point of x = 0.5 we can calculate the likelihood that this data point came from each of these three normal distributions. Implicit in the likelihood is that we’re looking at the probability of the data in a very small window, \(\delta x\), and that \(\delta x\) never changes so that:
\[Pr(x < X < x + \delta x) = \int_{x}^{x + \delta x} f(x)dx \propto f(x)\]
dnorm(0.5,0,1)
## [1] 0.3520653
dnorm(0.5,2,1)
## [1] 0.1295176
dnorm(0.5,-1,2)
## [1] 0.1505687
This shows that the first distribution has a higher likelihood than the other two, which are about the same. We interpret this as saying that the observed data point was more likely to have been generated by the first distribution than the other two. This is consistent with where the dashed blue line intersects each of the curves.
This plot of the normal distribution and the effects of varying the parameters in the normal are both probably familiar to you already - changing the mean changes where the distribution is centered while changing the standard deviation changes the spread of the distribution. Next try looking at the CDF of the normal:
plot(x,pnorm(x,0,1),type='l')
abline(v=0)
lines(x,pnorm(x,2,1),col=2)
abline(v=2,col=2)
lines(x,pnorm(x,-1,2),col=3)
abline(v=-1,col=3)
Draw the CDF of Poisson distribution with mean 0.5,1 and 2 overlapped on the same plot using the command curve. Have different colors for each curve.
Using the CDF we can calculate \(Pr(X \leq x)\) for our data point
pnorm(0.5,0,1)
## [1] 0.6914625
pnorm(0.5,2,1)
## [1] 0.0668072
pnorm(0.5,-1,2)
## [1] 0.7733726
In a bolt factory, the diameter of the bolts produced follows Cauchy distribution with center 5 cm and scale 0.1 cm. A bolt is said to be defective if it’s diameter is not within 0.3 of the center . Find the probability that a randomly selected bolt from the manufacturing process is defective.
Next let’s look at the function qnorm. Since the input to this function is a quantile, the x-values for the plot are restricted to the range [0,1]. The \(p\)th quantile \(\xi_p\) is defined as
\[ \begin{align*} \xi_p = \inf_x \{x: P(X \leq x)> p\}, \end{align*} \]
i.e. the smallest value of \(x\) for which the cdf exceeds \(p\).
p = seq(0,1,by=0.01)
plot(p,qnorm(p,0,1),type='l',ylim=c(-5,5)) # ylim sets the y-axis range
# range returns the min/max as a 2-element vector
abline(h=0) # "h" for "horizontal"
lines(p,qnorm(p,2,1),col=2)
abline(h=2,col=2)
lines(p,qnorm(p,-1,2),col=3)
abline(h=-1,col=3)
Finally, let’s investigate the rnorm function for generating random numbers that have a normal distribution. Here we generate histograms that have a progressively larger sample size and compare that to the actual density of the standard normal.
n = c(10,100,1000,10000) # sequence of sample sizes
for(i in 1:4){ # loop over these sample sizes
hist(rnorm(n[i]),main=n[i],probability=TRUE,breaks=40)
#here breaks defines number of bins in the histogram
lines(x,dnorm(x),col=2)
}
To make these plots we introduced the use of a for loop. A for loop iterates over all specified values of some index. Above, we used i as our index, and specified values 1, 2, 3, and 4 with the 1:4 syntax from last week. for loops always start this way (including parentheses): for(index in vector). The code in the { } then executes once for every value in the vector, with i taking on the next value in the list. The for syntax in R is very flexible and will allow you to loop over a vector of anything, not just integers…characters, factors, filenames, etc. Also note that your index variable can be anything, but to avoid unwanted results make sure it’s not the same as a variable you’re using elsewhere in the code (e.g. indexing by x above would ruin everything, because x would be overwritten on every iteration).
One other technical note: like any function in R that generates random output, this example will give different results every time you run it.
This example demonstrates that as the number of random draws from a probability distribution increases, the histogram of those draws provides a better and better approximation of the density itself. There are many distributions that are easier to randomly sample from than solve for analytically.
In the same setup as in Your Turn 3, you select a bolt whose diameter is on the top \(5\%\). What is the diameter?
We are next going to investigate several other common probability distributions. There will be a few examples of the effects of varying parameters for each distribution, but you are strongly recommended to explore and try other values to see how they affect the shape of each PDF. You can refer to the Wikipedia articles for most of them for a quick reference.
Note, if you didn’t start with the tutorial above, you’ll need to define x and p variables before running the codes below.
The uniform is used when there’s an equal probability of an event occurring over a range of values.
plot(x,dunif(x),type='l')
lines(x,dunif(x,0,4),col=2)
lines(x,dunif(x,-3,-0.5),col=3)
plot(x,punif(x),type='l')
lines(x,punif(x,0,4),col=2)
lines(x,punif(x,-3,-0.5),col=3)
plot(p,qunif(p),type='l',ylim=range(x))
lines(p,qunif(p,0,4),col=2)
lines(p,qunif(p,-3,-0.5),col=3)
hist(runif(500,-3,3),breaks=30,xlim=range(x),probability=TRUE)
lines(x,dunif(x,-3,3),col=2)
The Beta is an interesting distribution because it is bound on the range \(0\leq X\leq1\) and thus is very often used to describe data that is a proportion. At first glance the mathematical formula for the Beta looks a lot like the Binomial:
\[Beta(x \mid a,b) \propto x^{a-1} (1-x)^{b-1}\] \[Binom(x \mid n,p) \propto p^x (1-p)^{n-p}\]
The critical difference between the two is that in the Beta the random variable X is in the base while in the Binomial it is in the exponent. Unlike many distributions you may be have used in the past, the two shape parameters in the Beta do not define the mean and variance, but these can be calculated as simple functions of \(\alpha\) and \(\beta\). The Beta does have an interesting property of symmetry though, whereby Beta(\(\alpha\), \(\beta\)) is the reflection of Beta(\(\beta\),\(\alpha\)).
plot(p,dbeta(p,5,5),type='l')
lines(p,dbeta(p,1,1),col=2)
lines(p,dbeta(p,0.2,0.2),col=3)
## vary beta
plot(p,dbeta(p,6,6),type='l',ylim=c(0,5))
lines(p,dbeta(p,6,4),col=2)
lines(p,dbeta(p,6,2),col=3)
lines(p,dbeta(p,6,1.25),col=4)
lines(p,dbeta(p,6,1),col=5)
lines(p,dbeta(p,6,0.25),col=6)
legend("topleft",legend=c(6,4,2,1.25,1,0.5),lty=1,col=1:6)
plot(p,pbeta(p,5,5),type='l')
lines(p,pbeta(p,1,1),col=2)
lines(p,pbeta(p,0.2,0.2),col=3)
hist(rbeta(500,3,3),breaks=30,xlim=range(p),probability=TRUE)
lines(p,dbeta(p,3,3),col=2)
lines(p,dbeta(p,3,3),col=2)
Brain Teaser:
The lognormal is a log transform of the normal distribution. It is defined on the range X > 0 so is commonly used for data that cannot be negative by definition. The distribution is also positively skewed so is often used for skewed data. One thing that often goes unappreciated with the log-normal is that the mean, E[X], depends on the variance:
\[E[X] = e^{\mu + {{1}\over{2}}\sigma^2}\]
This applies not just when you explicitly use the lognormal, but also whenever you log-transform data and then calculate a mean or standard deviation - a fact that is vastly under-appreciated in the biological and environmental sciences and frequently missed in the published literature. In fact, ANY data transformation applied to make data “more normal” will change the mean, with the functional form of the bias depending on the transformation used. You can not simply back-transform the data without correcting for this bias. This phenomena is another illustration of Jensen’s Inequality.
## changing the mean
x <- 10^seq(-2,2,by=0.01)
plot(x,dlnorm(x,0),type='l',xlim=c(0,15),main="Changing the Mean")
lines(x,dlnorm(x,1),col=2)
lines(x,dlnorm(x,2),col=3)
legend("topright",legend=0:2,lty=1,col=1:3)
## on a log scale
plot(x,dlnorm(x,0),type='l',log='x',main="Log Scale")
lines(x,dlnorm(x,1),col=2)
lines(x,dlnorm(x,2),col=3)
abline(v=exp(0),col=1)
abline(v=exp(1),col=2)
abline(v=exp(2),col=3)
legend("topright",legend=0:2,lty=1,col=1:3)
## changing the variance
plot(x,dlnorm(x,2,.125),type='l',xlim=c(0,20),ylim=c(0,0.6),main="Changing the Variance")
lines(x,dlnorm(x,2,0.25),col=2)
lines(x,dlnorm(x,2,0.5),col=3)
lines(x,dlnorm(x,2,1),col=4)
lines(x,dlnorm(x,2,2),col=5)
lines(x,dlnorm(x,2,4),col=6)
abline(v=exp(2),col=1)
legend("topright",legend=c(0.125,0.25,0.5,1,2,4),lty=1,col=1:6)
## random sample
hist(rlnorm(250,2,1),breaks=30,probability=TRUE)
lines(x,dlnorm(x,2,1),col=4)
The exponential distribution arises naturally as the time it takes for an event to occur when the average rate of occurrence, r, is constant. The exponential is a special case of the Gamma (discussed next) where \(Exp(X \mid r) = Gamma(X \mid 1,r)\). The exponential is also a special case of the Weibull, \(Exp(X \min r) = Weibull(X \mid r,1)\), where the Weibull is a generalization of the exponential that allows the rate parameter r to increase or decrease with time. The Laplace is basically a two-sided exponential and arises naturally if one is dealing with absolute deviation, \(|x-m|\), rather than squared deviation, \((x-m)^2\), as is done with the normal.
## changing the mean
x <- seq(0,10,by=0.01)
plot(x,dexp(x,0.125),type='l',ylim=c(0,1))
lines(x,dexp(x,0.25),col=2)
lines(x,dexp(x,0.5),col=3)
lines(x,dexp(x,1),col=4)
lines(x,dexp(x,2),col=5)
lines(x,dexp(x,4),col=6)
legend("topright",legend=c(0.125,0.25,0.5,1,2,4),lty=1,col=1:6)
## random sample
hist(rexp(250,2),breaks=30,probability=TRUE)
lines(x,dexp(x,2),col=4)
## laplace vs Gaussian
plot(x,dexp(abs(x-5),1)/2,type='l')
lines(x,dnorm(x,5),col=2)
plot(x,dexp(abs(x-5),1)/2,type='l',log='y') ## same plot as last but on a log scale
lines(x,dnorm(x,5),col=2)
The gamma and inverse-gamma distribution are flexible distributions defined for positive real numbers. These are frequently used to model the distribution of variances or precisions (precision = 1/variance), in which case the shape and rate parameters are related to the sample size and sum of squares, respectively. The gamma is frequently used in Bayesian statistics as a prior distribution, and also in mixture distributions for inflating the variance of another distribution
x <- seq(0,10,by=0.01)
## change rate
plot(x,dgamma(x,3,3),type='l' ,ylim=c(0,1.6))
lines(x,dgamma(x,3,1),col=2)
lines(x,dgamma(x,3,6),col=3)
lines(x,dgamma(x,3,1/3),col=4)
legend("topright",legend=c(3,1,6,0.33),lty=1,col=1:4)
## change shape
plot(x,dgamma(x,3,3),type='l')
lines(x,dgamma(x,1,3),col=2)
lines(x,dgamma(x,6,3),col=3)
lines(x,dgamma(x,18,3),col=4)
legend("topright",legend=c(3,1,6,18),lty=1,col=1:4)
## change variance
a <- c(20,15,10,5,2.5,1.25)
r <- c(4,3,2,1,0.5,0.25)
plot(x,dgamma(x,20,4),type='l')
for(i in 1:6){
lines(x,dgamma(x,a[i],r[i]),col=i)}
var = a/r^2
mean = a/r
legend("topright",legend=format(var,digits=3),lty=1,col=1:6)
var
## [1] 1.250000 1.666667 2.500000 5.000000 10.000000 20.000000
mean
## [1] 5 5 5 5 5 5
## change mean
var = 4
mean = c(1,2,3,4,5)
rate = mean/var
shape = mean^2/var
plot(x,dgamma(x,shape[2],rate[2]),type='l',col=2)
for(i in 1:5){
lines(x,dgamma(x,shape[i],rate[i]),col=i)
}
legend("topright",legend=1:5,lty=1,col=1:5)
rate
## [1] 0.25 0.50 0.75 1.00 1.25
shape
## [1] 0.25 1.00 2.25 4.00 6.25
Mode of a distribution is that value(s) of \(x\) for which the density is the highest. Find the modes of
The binomial arises naturally from counts of the number of successes given a probability of success, p, and a sample size, n.
x <- 0:11
## vary size of sample (number of draws)
size = c(1,5,10)
plot(x,dbinom(x,size[1],0.5),type='s')
for(i in 2:3){
lines(x,dbinom(x,size[i],0.5),type='s',col=i)
}
legend("topright",legend=size,lty=1,col=1:3)
## vary probability
n = 10
p = c(0.1,0.25,0.5)
plot(x,dbinom(x,n,p[1]),type='s')
for(i in 2:3){
lines(x,dbinom(x,n,p[i]),col=i,type='s')
}
abline(v = n*p,col=1:3,lty=2)
legend("topright",legend=p,lty=1,col=1:3)
## CDF
plot(x,pbinom(x,n,p[1]),type='s',ylim=c(0,1))
for(i in 2:3){
lines(x,pbinom(x,n,p[i]),col=i,type='s')
}
legend("bottomright",legend=p,lty=1,col=1:3)
## Random samples
hist(rbinom(100,10,0.5)+0.0001, ## small amount added because
probability=TRUE, ## of way R calculates breaks
breaks=0:11,main="Random Binomial")
lines(x,dbinom(x,10,0.5),type='s',col=2) #Analytical solution
legend("topright",legend=c("sample","pmf"),lty=1,col=1:2)
The Poisson is also very common for count data and arises as the number of events that occur in a fixed amount of time (e.g. number of bird sightings per hour), or the number of items found in a fixed amount of space (e.g. the number of trees in a plot). Unlike the Binomial distribution the Poisson doesn’t have a fixed upper bound for the number of events that can occur.
x <- 0:12
plot(x,dpois(x,1),type='s')
lines(x,dpois(x,2),type='s',col=2)
lines(x,dpois(x,5),type='s',col=3)
legend("topright",legend=c(1,2,5),lty=1,col=1:3)
x <- 20:80
plot(x,dpois(x,50),type='s',ylim=c(0,0.08)) #Poisson with mean 50 (variance = 50)
lines(x+0.5,dnorm(x,50,sqrt(50)),col=2) #Normal with mean and variance of 50
lines(x,dbinom(x,100,0.5),col=3,type='s') #Binomial with mean 50 (variance = 25)
legend("topright",legend=c("pois","norm","binom"),col=1:3,lty=1)
plot(x,dbinom(x,100,0.5),type='s',col=3) #Binomial with mean 50 (variance = 25)
lines(x+0.5,dnorm(x,50,sqrt(25)),col=2) #Normal with mean 50 and variance of 25
lines(x,dpois(x,50),col=1,type='s') #Poisson with mean 50 (variance = 50)
legend("topright",legend=c("pois","norm","binom"),col=1:3,lty=1)
The last two panels depict a comparison of the Poisson, Normal, and Binomial with the same mean and a large sample size. The Poisson and Binomial are identical in the two figures but in the first the normal has the same variance as the Poisson and the second it is the same as the binomial.
The negative binomial has two interesting interpretations that are subtlety different from either the Poisson or Binomial. In the first case, it is the number of trials needed in order to observe a fixed number of occurrences, which is the opposite from the Binomial’s number of occurrences in a fixed trial size and thus where it gets its name. The Negative Binomial also arises as the distribution of number of events that occur in a fixed space or time when the rate is not constant (as in the Poisson) but varies according to a Gamma distribution. Hence the Negative Binomial is also used to describe data that logically seems to come from a Poisson process but has greater variability that is expected from the Poisson (which by definition has a variance equal to its mean). The Geometric distribution arises as a special case of the negative binomial where the number of occurrences is fixed at 1.
x <- 0:20
## negative binomial
## vary size
plot(x,dnbinom(x,1,0.5),type="s",main="vary size")
lines(x,dnbinom(x,2,0.5),type="s",col=2)
lines(x,dnbinom(x,3,0.5),type="s",col=3)
lines(x,dnbinom(x,5,0.5),type="s",col=4)
lines(x,dnbinom(x,10,0.5),type="s",col=5)
legend("topright",legend=c(1,2,3,5,10),col=1:5,lty=1)
## vary probability
plot(x,dnbinom(x,3,0.5),type="s",main="vary probability")
lines(x,dnbinom(x,3,0.3),type="s",col=2)
lines(x,dnbinom(x,3,0.2),type="s",col=3)
lines(x,dnbinom(x,3,0.1),type="s",col=4)
legend("topright",legend=c(0.5,0.3,0.2,0.1),col=1:5,lty=1)
## vary variance , alternate parameterization
mean = 8
var = c(10,20,30)
size = mean^2/(var-mean)
plot(x,dnbinom(x,mu=mean,size=size[1]),type="s",ylim=c(0,0.14),main="vary variance")
lines(x,dnbinom(x,mu=mean,size=size[2]),type="s",col=2)
lines(x,dnbinom(x,mu=mean,size=size[3]),type="s",col=3)
legend('topright',legend=format(c(var,mean),digits=2),col=1:4,lty=1)
lines(x,dpois(x,mean),col=4,type="s")
## NB as generalization of pois with inflated variance
## geometric
plot(x,dgeom(x,0.5),type="s",main="Geometric")
lines(x,dgeom(x,0.15),type="s",col=2)
lines(x,dgeom(x,0.05),type="s",col=3)
lines(x,dnbinom(x,1,0.15),type="s",col=4,lty=2)
## geometric as special case of NB where size = 1